Using Word and Phrase Abbreviation Patterns to Extract Age From Twitter Microtexts
نویسنده
چکیده
The wealth of texts available publicly online for analysis is ever increasing. Much work in computational linguistics focuses on syntactic, contextual, morphological and phonetic analysis on written documents, vocal recordings, or texts on the internet. Twitter messages present a unique challenge for computational linguistic analysis due to their constrained size. The constraint of 140 characters often prompts users to abbreviate words and phrases. Additionally, as an informal writing medium, messages are not expected to adhere to grammatically or orthographically standard English. As such, Twitter messages are noisy and do not necessarily conform to standard writing conventions of linguistic corpora, often requiring special pre-processing before advanced analysis can be done. In the area of computational linguistics, there is an interest in determining latent attributes of an author. Attributes such as author gender can be determined with quite some success from many sources, using various methods, such as shallow linguistic patterns or topic analysis. Author age is more difficult to determine, but previous research has been fairly successful at classifying age as a binary, ternary, or even continuous variable using various techniques. Twitter messages present a difficult problem for latent user attribute analysis, due to the preprocessing necessary for many computational linguistics analysis tasks. An added logistical challenge is that very few latent attributes are explicitly defined by users on Twitter. Twitter messages are a part of an enormous data set, but the data set must be independently annotated for latent writer attributes before any classification on such attributes not defined through the Twitter API can be done. The actual classification problem is a particular challenge due to the restrictions on tweets. Previous work has shown that word and phrase abbreviation patterns used on Twitter can be indicative of some latent user attributes, such as geographic region or the Twitter client used to make posts. Language changes tend to be driven by youths, often in adolescence, who tend to gradually affect older authors. Older language users either adopt the changes or resist them. Twitter is a relatively new service. This study explores if there there are longitudinal patterns evident in 6 years of online interactions. I propose the development of a large, growable data set annotated by Twitter users themselves for age and other useful attributes. I also propose an extension of prior work on Twitter abbreviation patterns to determine if these linguistic patterns are indicative of author age at time of posting and can be useful in determining an author’s age, and to investigate what, if any, changes in these patterns occur over time. Lastly, I propose a time line for thesis completion with deliverables.
منابع مشابه
Contextual Bearing on Linguistic Variation in Social Media
Microtexts, like SMS messages, Twitter posts, and Facebook status updates, are a popular medium for real-time communication. In this paper, we investigate the writing conventions that different groups of users use to express themselves in microtexts. Our empirical study investigates properties of lexical transformations as observed within Twitter microtexts. The study reveals that different pop...
متن کاملA Spatial and Temporal Sentiment Analysis Approach Applied to Twitter Microtexts
The widespread of social communication media in the Web has produced a large volume of opinionated textual data stored in digital format. Social media constitutes a rich source for sentiment analysis and understanding of the opinions spontaneously expressed. Many scientific proposals have arisen in the last years aiming to deal with sentiment analysis issues. However, most of them do not addres...
متن کاملWord Definitional Skills in School age Persian Speaking Children: A Developmental Study on Form
Objectives: Word definitional skills (WDSs) were classified according to the quality of both semantic content and syntactic form. The aim of the present study was to investigate the syntactic development in WDSs in typically developing school- age children. Methods: In this cross-sectional and descriptive-analytical study, 150 of typically developing school-age children participated by...
متن کاملForecasting Stock Price Movements Based on Opinion Mining and Sentiment Analysis: An Application of Support Vector Machine and Twitter Data
Today, social networks are fast and dynamic communication intermediaries that are a vital business tool. This study aims at examining the views of those involved with Facebook stocks so that we can summarize their views to predict the general behavior of this stock and collectively consider possible Facebook stock price movements, and create a more accurate pattern compared to previous patterns...
متن کاملمدل ترجمه عبارت-مرزی با استفاده از برچسبهای کمعمق نحوی
Phrase-boundary model for statistical machine translation labels the rules with classes of boundary words on the target side phrases of training corpus. In this paper, we extend the phrase-boundary model using shallow syntactic labels including POS tags and chunk labels. With the priority of chunk labels, the proposed model names non-terminals with shallow syntactic labels on the boundaries of ...
متن کامل